fix(qwen-asr): enable timestamp output when forced_aligner is configured by fqscfqj · Pull Request #10013 · mudler/LocalAI

fqscfqj · 2026-05-26T08:33:46Z

Problem

The qwen-asr backend loads the forced_aligner model correctly but never actually produces timestamps. All segments return start=0, end=0.

Two bugs cause this:

Bug 1: `return_time_stamps` not passed to `transcribe()`

Qwen3ASRModel.transcribe() defaults return_time_stamps=False. The backend never passes True, so the forced aligner is loaded but silently skipped during inference.

Bug 2: Timestamp item format mismatch

The parsing code checks isinstance(ts, (list, tuple)), but qwen_asr returns ForcedAlignItem dataclass instances with .text, .start_time, .end_time attributes — not tuples. The check always fails, so timestamps are zeroed out even if Bug 1 were fixed.

Fix

Pass return_time_stamps=True to transcribe() when a forced_aligner is loaded.
Add hasattr() check for ForcedAlignItem dataclass before falling back to tuple parsing.

Testing

Verified against qwen3-asr-0.6b with Qwen/Qwen3-ForcedAligner-0.6B — timestamps now return correctly in verbose_json, srt, and vtt formats.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for returning and parsing word/segment timestamps when the Qwen ASR forced aligner is available.

Changes:

Detects presence of a forced aligner and requests timestamps from model.transcribe.
Adds parsing for forced-aligner timestamp objects (start_time, end_time, text) in addition to tuple/list timestamps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            results = self.model.transcribe(
+                audio=audio_path, language=language, context=context,
+                return_time_stamps=has_aligner,
+            )


+                    if hasattr(ts, 'start_time') and hasattr(ts, 'end_time') and hasattr(ts, 'text'):
+                        # ForcedAlignItem dataclass (from qwen_asr forced aligner)
+                        start_ms = int(ts.start_time * 1000) if ts.start_time is not None else 0
+                        end_ms = int(ts.end_time * 1000) if ts.end_time is not None else 0
+                        seg_text = ts.text or ""
+                    elif isinstance(ts, (list, tuple)) and len(ts) >= 3:


Two bugs prevented timestamps from working in the qwen-asr backend: 1. transcribe() was called without return_time_stamps=True, so the forced aligner was loaded but never invoked. Now we pass return_time_stamps=True when a forced_aligner is present. 2. The timestamp parsing code expected (list, tuple) items, but the qwen_asr library returns ForcedAlignItem dataclass instances with .text, .start_time, .end_time attributes. Added hasattr() check to handle this correctly, falling back to tuple parsing for backward compatibility.

- Wrap return_time_stamps kwarg in try/except TypeError for safety - Add defensive float() normalization for timestamp times - Use str() for text extraction to ensure string type

The Go server reads TranscriptSegment.start/end via time.Duration, which is in nanoseconds. Previously the backend sent milliseconds (* 1000), causing timestamps to be 1000x too small (e.g. 8e-8 instead of 0.08). Convert seconds → nanoseconds (* 1e9) instead. Also applies to the legacy tuple path for consistency.

fqscfqj · 2026-05-26T09:58:27Z

Additional fix: seconds → nanoseconds

While testing this PR against a real deployment, I discovered a third issue beyond the two bugs described above:

Bug 3: Timestamp unit mismatch between Python backend and Go server

The Go server reads (int64) and wraps them in :

// core/backend/transcript.go
segments = append(segments, &schema.TranscriptSegment{
    Start: time.Duration(s.Start),
    End:   time.Duration(s.End),
})

Go's time.Duration is in nanoseconds, but the backend was sending milliseconds (* 1000). This caused timestamps to be 1000x too small — e.g. 8e-8 seconds instead of 0.08 seconds.

Fix: Convert seconds → nanoseconds (* 1_000_000_000) instead of seconds → milliseconds (* 1000).

Verified output

After applying all three fixes, verbose_json, srt, and vtt formats all produce correct timestamps:

{"segments": [{"id": 0, "start": 0.08, "end": 0.24, "text": "今"}, ...]}

1
00:00:00,080 --> 00:00:00,240
今

Pushed the additional commit to the PR branch.

Read request.timestamp_granularities from the gRPC request. - 'word': return one segment per aligned item (character / word) - 'segment' (default): merge consecutive items at sentence boundaries Sentence boundaries detected via CJK punctuation (。！？；…) and Latin endings (. ! ? ;). This matches the OpenAI Whisper API contract where omitting the parameter defaults to segment-level.

Unicode curly quotes (U+2018/2019) were being interpreted as Python string delimiters, causing SyntaxError. Use explicit unicode escapes.

The forced aligner strips punctuation from its output, so text-based sentence detection doesn't work. Instead, detect segment boundaries by measuring time gaps between consecutive aligned items. Threshold = max(median_gap * 4, 0.3s). This cleanly separates intra-sentence gaps (< 0.24s) from inter-sentence gaps (> 0.3s) across Chinese, English, and other languages.

The forced aligner strips whitespace from tokenized text, so English words like ['hello', 'world'] were joined as 'helloworld'. Add _smart_join() that inserts spaces between non-CJK tokens while keeping CJK characters and punctuation unspaced. Works for Chinese, English, Korean, Japanese, and mixed-language text.

Copilot AI review requested due to automatic review settings May 26, 2026 08:33

Copilot AI reviewed May 26, 2026

View reviewed changes

fqscfqj mentioned this pull request May 26, 2026

bug: ASR backends silently fail on non-string upstream return types (nemo TDT/RNNT, qwen-asr timestamps) #10014

Open

fqscfqj added 2 commits May 26, 2026 08:51

refactor: address Copilot review for qwen-asr timestamps

346c5d2

- Wrap return_time_stamps kwarg in try/except TypeError for safety - Add defensive float() normalization for timestamp times - Use str() for text extraction to ensure string type

fqscfqj force-pushed the fix/qwen-asr-timestamps branch from 4f283ba to 346c5d2 Compare May 26, 2026 08:51

fqscfqj added 4 commits May 26, 2026 10:05

fix(qwen-asr): escape smart quotes in punctuation set

dd4e86b

Unicode curly quotes (U+2018/2019) were being interpreted as Python string delimiters, causing SyntaxError. Use explicit unicode escapes.

mudler enabled auto-merge (squash) May 26, 2026 20:09

mudler approved these changes May 26, 2026

View reviewed changes

mudler merged commit 4e5ec6f into mudler:master May 26, 2026
60 checks passed

BrewTestBot mentioned this pull request May 27, 2026

localai 4.3.2 Homebrew/homebrew-core#285003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(qwen-asr): enable timestamp output when forced_aligner is configured#10013

fix(qwen-asr): enable timestamp output when forced_aligner is configured#10013
mudler merged 7 commits into
mudler:masterfrom
fqscfqj:fix/qwen-asr-timestamps

fqscfqj commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

fqscfqj commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

fqscfqj commented May 26, 2026

Problem

Bug 1: return_time_stamps not passed to transcribe()

Bug 2: Timestamp item format mismatch

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

fqscfqj commented May 26, 2026

Additional fix: seconds → nanoseconds

Bug 3: Timestamp unit mismatch between Python backend and Go server

Verified output

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 1: `return_time_stamps` not passed to `transcribe()`